Add dataproc tpcds example notebook by viadea · Pull Request #607 · NVIDIA/cudf-spark-examples

viadea · 2025-11-19T22:24:43Z

Add an example tpcds notebook for GCP dataproc.

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

greptile-apps · 2025-11-19T22:27:23Z

Greptile Summary

This PR adds a Dataproc-specific TPC-DS benchmark notebook (TPCDS-SF3K-Dataproc.ipynb) and extends the README with gcloud CLI instructions for spinning up a GPU-enabled Dataproc cluster, running GPU and CPU Spark benchmarks side-by-side, and plotting speedup results.

The notebook follows the same CPU-vs-GPU comparison pattern as the existing Colab notebook but is adapted for Dataproc's pre-configured Spark environment (SparkSession via getOrCreate, JAR pre-loaded via cluster properties).
Several leftovers from the source notebook remain: the scala_version detection cell result is never referenced downstream, sparkmeasure is pip-installed and pre-loaded as a cluster JAR but no Python sparkmeasure APIs are called, from importlib.resources import files is imported but unused, and the appName still reads "NDS Example" instead of a TPC-DS label.

Confidence Score: 4/5

Safe to merge as an example notebook; all findings are cosmetic or dead-code cleanup that do not affect benchmark correctness.

The benchmark logic itself is sound — GPU/CPU runs are clearly separated, results are merged and plotted correctly, and the cluster setup instructions are complete. The issues found are limited to copy-paste residue: a wrong appName, an unused scala_version detection cell, and sparkmeasure being installed and configured at the cluster level without any actual usage in the notebook.

The notebook TPCDS-SF3K-Dataproc.ipynb has the unused cells and wrong app name worth cleaning up before the example is widely shared.

Important Files Changed

Filename	Overview
examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb	New Jupyter notebook for running TPCDS GPU vs CPU benchmarks on GCP Dataproc; contains a copy-paste error in appName, a dead code cell for scala_version detection, and installs/configures sparkmeasure without ever using it.
examples/SQL+DF-Examples/tpcds/README.md	Adds a Dataproc cluster creation section with gcloud CLI commands and environment variable setup; instructions are clear and include a note to adjust the shuffle manager class per Spark version.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Install packages\ntpcds_pyspark, sparkmeasure, pandas, matplotlib] --> B[Import modules]
    B --> C[Detect Scala version from spark-sql JAR\n⚠️ result unused]
    C --> D[Create SparkSession\nappName='NDS Example' ⚠️]
    D --> E[Verify GPU acceleration\nspark.range + explain]
    E --> F[Init TPCDS\ndata_path=gs://GCS_PATH_TO_TPCDS_DATA/]
    F --> G[Register TPC-DS tables\ntpcds.map_tables]
    G --> H[GPU Run\nspark.rapids.sql.enabled=True\ntpcds.run_TPCDS]
    H --> I[CPU Run\nspark.rapids.sql.enabled=False\ntpcds.run_TPCDS]
    I --> J[Merge results\ncompute speedup]
    J --> K[Plot elapsed time comparison]
    J --> L[Plot speedup factors]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Install packages\ntpcds_pyspark, sparkmeasure, pandas, matplotlib] --> B[Import modules]
    B --> C[Detect Scala version from spark-sql JAR\n⚠️ result unused]
    C --> D[Create SparkSession\nappName='NDS Example' ⚠️]
    D --> E[Verify GPU acceleration\nspark.range + explain]
    E --> F[Init TPCDS\ndata_path=gs://GCS_PATH_TO_TPCDS_DATA/]
    F --> G[Register TPC-DS tables\ntpcds.map_tables]
    G --> H[GPU Run\nspark.rapids.sql.enabled=True\ntpcds.run_TPCDS]
    H --> I[CPU Run\nspark.rapids.sql.enabled=False\ntpcds.run_TPCDS]
    I --> J[Merge results\ncompute speedup]
    J --> K[Plot elapsed time comparison]
    J --> L[Plot speedup factors]

_{Reviews (1): Last reviewed commit: "Clear a cell output" | Re-trigger Greptile}

greptile-apps

_{2 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

greptile-apps · 2025-11-19T22:27:21Z

+    "]\n",
+    "\n",
+    "demo_start = time.time()\n",
+    "tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"


syntax: gs://gcs_bucket is a placeholder - should be updated to match the $GCS_BUCKET variable pattern used in the README

Suggested change

"tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

"tpcds = TPCDS(data_path='gs://$GCS_BUCKET/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

gerashegalov · 2025-11-19T22:42:47Z

+      "text/html": [
+       "\n",
+       "            <div>\n",
+       "                <p><b>SparkSession - hive</b></p>\n",
+       "                \n",
+       "        <div>\n",
+       "            <p><b>SparkContext</b></p>\n",
+       "\n",
+       "            <p><a href=\"http://testbyhao2-ubuntu22-m.c.rapids-spark.internal:46705\">Spark UI</a></p>\n",
+       "\n",
+       "            <dl>\n",
+       "              <dt>Version</dt>\n",
+       "                <dd><code>v3.5.3</code></dd>\n",
+       "              <dt>Master</dt>\n",
+       "                <dd><code>yarn</code></dd>\n",
+       "              <dt>AppName</dt>\n",
+       "                <dd><code>PySparkShell</code></dd>\n",
+       "            </dl>\n",
+       "        </div>\n",
+       "        \n",
+       "            </div>\n",
+       "        "
+      ],


Please clear the notebook output for the PR

Sure will do.

Cleared the all output.

gerashegalov · 2025-11-19T22:44:08Z

Please add a PR description

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

gerashegalov · 2025-11-21T00:09:11Z

Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications.

We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers

sameerz · 2025-12-02T21:40:07Z

Please add a performance benchmark running on the CPU vs. GPU.

sameerz · 2025-12-08T21:50:45Z

Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications.

We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers

Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user.

Understand that will create maintenance overhead.

gerashegalov · 2025-12-09T19:40:40Z

Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user.

Understand that will create maintenance overhead.

The PR already assumes CSP-specific instructions for launching it if you look at the proposed README changes. I bet that there is already enough specifics in the default environment even without it to make minor adjustments to create minor CSP-specific logic in the notebook. If not it can be part of the command documented for the user anyways.

nvauto · 2026-01-26T02:09:23Z

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

nvauto · 2026-03-30T06:29:58Z

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

Add dataproc tpcds example notebook

85db578

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

viadea requested a review from gerashegalov November 19, 2025 22:24

greptile-apps Bot reviewed Nov 19, 2025

View reviewed changes

gerashegalov reviewed Nov 19, 2025

View reviewed changes

Cleared the notebook output and did some minor change on README.

7eac061

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

viadea requested a review from gerashegalov November 20, 2025 23:17

greptile-apps Bot reviewed Nov 20, 2025

View reviewed changes

Hao Zhu added 2 commits November 20, 2025 15:28

Modify some format issue for README

6d9aa88

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

Clear a cell output

19d0c6e

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

greptile-apps Bot reviewed Nov 20, 2025

View reviewed changes

	"tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"
	"tpcds = TPCDS(data_path='gs://$GCS_BUCKET/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

Uh oh!

Conversation

viadea commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gerashegalov Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

viadea Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

viadea Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gerashegalov commented Nov 19, 2025

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gerashegalov commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameerz commented Dec 2, 2025

Uh oh!

sameerz commented Dec 8, 2025

Uh oh!

gerashegalov commented Dec 9, 2025

Uh oh!

nvauto commented Jan 26, 2026

Uh oh!

nvauto commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viadea commented Nov 19, 2025 •

edited

Loading

greptile-apps Bot commented Nov 19, 2025 •

edited

Loading

gerashegalov commented Nov 21, 2025 •

edited

Loading